1 Getting ready

1.1 Installing R

Go on this link to download R: https://cran.rstudio.com/

Select the version that works for your operating system, and download the latest release (R-3.6.0).

Download R.

Figure 1.1: Download R.

Once you’ve downloaded R, install it following the instructions on the screen.

1.2 Installing R Studio

Go on this link to download R Studio: https://www.rstudio.com/products/rstudio/download/#download

And then download the version that works for your operating system.

Download R Studio.

Figure 1.2: Download R Studio.

Once you’ve downloaded R Studio, install it following the instructions on the screen.

2 Why R?

3 Setting things up

3.1 R Studio

R Studio is a great integrated development environment (IDE) in which you can do all your R coding.

Before we get started, let’s change some of the settings in R Studio first.

General preferences.

Figure 3.1: General preferences.

Make sure that:

  • Restore .RData into workspace at startup is unselected
  • Save workspace to .RData on exit is set to Never
Code window preferences.

Figure 3.2: Code window preferences.

This makes sure that each time we run R Studio, we are starting with a fresh environment rather than still having variables saved from a previous run (which can cause trouble).

Make sure that:

  • Soft-wrap R source files is selected

This way you don’t have to scroll horizontally. At the same time, avoid writing long single lines of code. For example, instead of writing code like so:

ggplot(data = diamonds, aes(x = cut, y = price)) +
  stat_summary(fun.y = "mean", geom = "bar", color = "black", fill = "lightblue", width = 0.85) +
  stat_summary(fun.data = "mean_cl_boot", geom = "linerange", size = 1.5) +
  labs(title = "Price as a function of quality of cut", subtitle = "Note: The price is in US dollars", tag = "A", x = "Quality of the cut", y = "Price")

You may want to write it this way instead:

ggplot(data = diamonds, aes(x = cut, y = price)) +
  # display the means
  stat_summary(fun.y = "mean",
               geom = "bar",
               color = "black",
               fill = "lightblue",
               width = 0.85) +
  # display the error bars
  stat_summary(fun.data = "mean_cl_boot",
               geom = "linerange",
               size = 1.5) +
  # change labels
  labs(title = "Price as a function of quality of cut",
       subtitle = "Note: The price is in US dollars", # we might want to change this later
       tag = "A",
       x = "Quality of the cut",
       y = "Price")

This makes it much easier to see what’s going on, and you can easily add comments to individual lines of code.

RStudio makes it easy to write nice code. It figures out where to put the next line of code when you press ENTER. And if things ever get messy, just select the code of interest and hit cmd + i to re-indent the code.

Here are some more tips on how to write nice code in R:

And here is cheatsheet with more useful information about R Studio:

3.2 Getting help

There are a few different ways to get help in R. You can either put a ? in front of the function you’d like to learn more about, or use the help() function.

?print
help("print")

Tip: To see the help file, hover over a function (or dataset) with the mouse (or select the text) and then press F1.

I recommend using F1 to get to help files – it’s the fastest way!

R help files can sometimes look a little cryptic. Most R help files have the following sections (copied from here):


Title: A one-sentence overview of the function.

Description: An introduction to the high-level objectives of the function, typically about one paragraph long.

Usage: A description of the syntax of the function (in other words, how the function is called). This is where you find all the arguments that you can supply to the function, as well as any default values of these arguments.

Arguments: A description of each argument. Usually this includes a specification of the class (for example, character, numeric, list, and so on). This section is an important one to understand, because arguments are frequently a cause of errors in R.

Details: Extended details about how the function works, provides longer descriptions of the various ways to call the function (if applicable), and a longer discussion of the arguments.

Value: A description of the class of the value returned by the function.

See also: Links to other relevant functions. In most of the R editors, you can click these links to read the Help files for these functions.

Examples: Worked examples of real R code that you can paste into your console and run.


Here is the help file for the print() function:

Help file for the print() function.

Figure 3.3: Help file for the print() function.

The help files in R are often quite cryptic and it can take some time until these are really helpful. Until then, google things! R has a very active community with a large number of posts on stackoverflow and other online forums.

3.3 Installing and maintaining packages

What makes R powerful is the large number of packages that have been written for R. You can install a new package like so:

install.packages("tidyverse")

You can also install multiple packages at the same time, by concatenating the package names using the c() function:

install.packages(c("tidyverse","broom"))

To make sure that your packages remain up to date, you can go to Tools > Check for Package Updates ... in R Studio.

Help file for the print() function.

Figure 3.4: Help file for the print() function.

You can then click Select All and then Install Updates.

Help file for the print() function.

Figure 3.5: Help file for the print() function.

R Studio might ask you to restart your R session before updating the packages.

3.4 R Markdown

R Markdown files are a great way of organizing ones code. This tutorial is written using R Markdown! Most importantly, you can put R code straight into your R Markdown file so that you can have everything in one place. Indeed, you can write a full paper in R Markdown if you like (using the package papaja).

There are two main ways of putting code into your R Markdown document. Most often, you will create a code chunk and put the code into that chunk, like so:

a = 1 + 2 
print(a)
[1] 3

You can also evaluate R code in line with other text like so: The value of a is 3. Notice how for the code chunk, we use three “``" and for inline code we use a single "” together with the letter r at the beginning.

The nice thing about these code chunks is that they show you the output directly underneath the chunk when you run it.

And a big advantage of using R Markdown is that you can render the file in different formats by “knitting” it. For example, I’ve created the “.html” file using this R Markdown file. This is a great way of sharing your code with others and contributing to open science this way.

You can find some more information R Markdown here:

3.5 Some general advice

Before diving into R, here are a few more general tips.

3.5.1 Naming folders and files

I suggest to always use lower case characters and avoid whitespace in folder and file names. Either use "_" or “-” instead of a white space. Some programs (e.g. LaTeX) cannot deal with white spaces in file paths.

3.5.2 Always use relative paths

In your R Markdown file, make sure to always use relative paths rather than full paths. For example, notice how I link to the cheatsheets like so “../../figures/cheatsheets/rmarkdown-reference.pdf” (relative path) rather than so “/Users/tobi/Documents/work/projects_git/r_tutorial/figures/cheatsheets/rmarkdown-reference.pdf” (absolute path).

Using relative paths has the advantage that your collaborators can run code just like you can. If you were to use an absolute path, then your collaborator wouldn’t be able ot run the file without changing the path first.

3.5.3 Naming variables, functions, etc.

3.5.4 Always load all packages at the top

This way,

3.5.5 Make sure that a script can be executed from top to bottom

For example, you don’t want it to be the case that in order to run code chunk 2, you have to run code chunk 3 first.

3.5.6 Keep your projects organized

This github repository uses a project structure that I like. I recommend keeping data, figures, and code separate. Using the same structure in different projects greatly helps to keep things organized.

3.5.7 Learn keyboard shortcuts!

Learning keyboard shortcuts will speed up your workflow immensely! You can view the default keyboard shortcuts here: Tools > Keyboard Shortcuts Help

You can also modify and add keyboard shortcuts via Tools > Modify Keyboard Shortcuts...

3.5.8 Don’t write past the vertical rule in code blocks

This way, your code will look nice when you knit your R Markdown file into a html or a pdf output.

3.6 R syntax

There are two main ways to code in R, one is called “base R” and the other is called “tidyverse”. The “tidyverse” is a collection of powerful packages that work very well with each other. It’s the modern way of coding in R, and this tutorial uses the tidyverse. That said, it’s still important to know how to write things using “base R”.

This cheatsheet summarizes some of the key aspects of “base R”

3.6.1 The pipe %>%

A key part of coding in the tidyverse is using the pipe operator %>% (pronounce “then”). Let me give a quick example:

a = sum(2, 3)
print(a)
[1] 5

Here, I’ve used the sum() function calculated the sum of 2 and 3 and assigned the result to variable a. Using the pipe, I can write this slightly differently:

a = 2 %>% 
  sum(3)
print(a)
[1] 5

What the pipe does is that it takes the result of the first computation, and enters it as a first argument to the next computation. This allows us to string many computations together in the sequence in which we want to do things.

Tip: The keyboard shortcut for the pipe is cmd + shit + m

4 Doing stuff

4.1 Loading packages

The order in which packages in R are loaded matters!

library("tidyverse")
library("MASS")
library("MASS")
library("tidyverse")

Both the MASS package and the tidyverse packages have a function called select(). In R, whichever package is loaded later, overwrites earlier packages that already use the same function names.

You can refer to functions from specific packages by adding the function name at the beginning. For example, this command would use the select() function from the MASS package MASS::select(), while this command would use the function from the dplyr package dplyr::select() (irrespective in which order you’ve loaded the packages). However, adding the package name to a function each time it’s called is cumbersome. That’s why we want to make sure to load the packages whose functions we use most frequently last.

In particular, I’d suggest to always load library("tidyverse") last because it loads a large number of often used functions.

4.2 Importing data

df.data = read_csv(file = "../../data/top2018songs.csv") %>% 
  mutate(rank = 1:nrow(.))
column description
id Spotify URI of the song
name Name of the song
artists Artist(s) of the song
danceability Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.
loudness The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness Predicts whether a track contains no vocals. ‘Ooh’ and ‘aah’ sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly ‘vocal’. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms The duration of the track in milliseconds.
time_signature An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).

The quickest way to take a look at your data is to hover your mouse over a variable of a data frame, and press F2.

Here is a cheatsheet with more information about how to import data into R: - importing data cheatsheet

4.3 Data visualiztion

4.3.1 How not to visualize data

We should always take a look at the data first.

include_graphics("../../figures/plots/bad_plot1.png")
A not so good plot.

Figure 4.1: A not so good plot.

include_graphics("../../figures/plots/bad_plot2.jpg")
Another could-be-improved plot.

Figure 4.2: Another could-be-improved plot.

This second plots reminded me of the following:

include_graphics("../../figures/plots/correlation_aint_causation.png")
Correlation is not causation.

Figure 4.3: Correlation is not causation.

Just because two lines look similar, doesn’t mean that anything interesting is going on – it certainly doesn’t mean that the two phenomena represented by the lines are causally connected. For more inspiration check out this site https://www.tylervigen.com/spurious-correlations.

4.3.2 Why you should always visualize your data first

__The Datasaurus Dozen__. While different in appearance, each dataset has the same summary statistics to two decimal places (mean, standard deviation, and Pearson's correlation).

Figure 4.4: The Datasaurus Dozen. While different in appearance, each dataset has the same summary statistics to two decimal places (mean, standard deviation, and Pearson’s correlation).

The data sets in Figure 4.4 all share the same summary statistics. Clearly, the data sets are not the same though.

Tip: Always plot the data first!

Here is the paper from which I took Figure 4.4. It explains how the figures were generated and shows more examples for how summary statistics and some kinds of plots are insufficient to get a good sense for what’s going on in the data.

include_graphics("../../figures/plots/box_violin.gif")
Boxplots can be misleading.

Figure 4.5: Boxplots can be misleading.

4.3.3 Visualizing data using ggplot2

ggplot(data = df.data,
       mapping = aes(x = danceability,
                     y = valence)) + 
  geom_point()

ggplot(data = df.data,
       mapping = aes(x = danceability,
                     y = valence)) + 
  geom_point() +
  geom_smooth(method = "lm")

ggplot(data = df.data,
       mapping = aes(x = mode,
                     y = valence)) + 
  stat_summary(fun.data = "mean_cl_boot",
               geom = "linerange") +
  stat_summary(fun.y = "mean",
               geom = "point",
               size = 3)

Surprising! Songs in a minor key (mode = 0) sound more positive than songs in a major key (mode = 1).

Here is a more involved plot that shows some of the things you can do with ggplot2:

df.plot = df.data %>% 
  mutate(mode = factor(mode,
                       levels = c(0, 1),
                       labels = c("minor", "major")),
         key = factor(key,
                      levels = 0:11,
                      labels = c("C", "C#", "D", "D#",
                                 "E", "F", "F#", "G",
                                 "G#", "A", "A#", "B")))

ggplot(data = df.plot,
       mapping = aes(x = key,
                     y = energy,
                     group = mode,
                     fill = mode)) + 
  # add individual data points 
  geom_point(mapping = aes(color = mode),
             position = position_jitterdodge(dodge.width = 0.7,
                                             jitter.width = 0.1,
                                             jitter.height = 0),
             alpha = 0.3) + 
  # add the error bars 
  stat_summary(fun.data = "mean_cl_boot",
               geom = "linerange",
               position = position_dodge(width = 0.7),
               size = 0.75) +
  # add the mean data points 
  stat_summary(fun.y = "mean",
               geom = "point",
               shape = 21,
               size = 3,
               position = position_dodge(width = 0.7)) +
  # add the vertical lines
  geom_vline(data = tibble(key = 1:10), 
             xintercept = seq(from = 1.5, to = 11.5, by = 1),
             linetype = 2,
             color = "gray80") + 
  # set title and subtitle of plot 
  labs(title = "Energy for songs with different key and mode",
       subtitle = "Energy represents a perceptual measure of intensity and activity.") + 
  # change the y-axis 
  scale_y_continuous(breaks = seq(0.25, 1, 0.25),
                     labels = seq(0.25, 1, 0.25),
                     limits = c(0.25, 1)) +
  # set the fill color 
  scale_fill_brewer(palette = "Set1") +
  # change the plotting theme
  theme_classic() +
  # adjust the text size
  theme(text = element_text(size = 16),
        plot.subtitle = element_text(size = 12))

Here are some cheatsheets with data visualization info: - ggplot2 cheatsheet - data visualization principles cheatsheet - animation cheatsheet

4.4 Data manipulation

Visualizing data is fun! But often, we need to spend quite a bit of time beating data into the right shape first. We want our data to be tidy – a tidy data frame has one row per observation. Once we have a tidy data frame, plotting things using ggplot2 becomes a breeze. Unfortunately, many data files aren’t tidy at all to start off with. For example, if you use Qualtrics to run your experiment, the data output will be far from tidy. So we have to learn how to beat our data into shape.

4.4.1 Data transformation

Here is the data transformation cheatsheet: - Data transformation cheatsheet

4.4.2 Data wrangling

Here is the data wrangling cheatsheet (data wrangling will take some time to get familiar with): - Data wrangling cheatsheet

4.5 Statistics

  • linear model lm()
  • linear mixed effects models lmer()
  • Bayesian models brm() (using library("brms"))

4.6 Help others help you

  • making reproducible examples

5 Where can I learn more?

5.1 Free online books